PACKAGE INSTALLATIONS

The whole process of my attempt to predict the Hotel Occupancy Rate (TPK BPS) was carried out using Jupyter Notebook version 6.1.6 on Python 3.8.2 x64 for Windows.

These are the libraries I used in this competition:

  • Pandas, for the data processing using table-like form
  • Numpy, for the data processing using array-like form
  • Scikit-learn, for the machine learning tasks
  • Plotly, for data graphing
  • Matplotlib for data plotting

The committee only gave us Daily Hotel Occupancy Rate retrieved online (tpk_harian) as X variable and Monthly Hotel Occupancy Rate published by BPS (tpk_bps) as Y variable The lack of data encouraged me to get other sources as follows:

  • covid_harian_aktif = daily covid active cases, retrieved from KawalCovid19
  • covid_harian = daily new covid cases, retrieved from KawalCovid19
  • covid_total = total cases of covid at the end of the month (last day), Retrieved from KawalCovid19
  • penerbangan = the number of flight passengers to Bali, retrieved from bali.bps.go.id
  • wisatawan = the number of domestic tourists coming to Bali, retrieved from bps.go.id and from Contact Person from Disparda Bali (Dinas Pariwisata Bali)
  • wisatawan_mancanegara = the number of foreign tourists coming to Bali, retrieved from bali.bps.go.id
  • tpk_bps_arima = the data of Monthly Hotel Occupancy Rate published by BPS (tpk_bps) from the previous months (y-1)
  • hari = the number of days in a month
  • mobility = google mobility data index for INDONESIA (not Bali in particular), retrieved from OurWorldInData.Org

However, after numerous trials, I found out that only 3 independent variables; tpk_online, penerbangan and wisatawan give significant decrease on the best model's RMSE.

As for the models, here are the ones I tried running on the data:

  • Linear Regression
  • Ridge Regression
  • Random Forest Regressor
  • Support Vector Regression (SVR)
  • K-Nearest Neighbor Regressor
  • MLPRegressor (Neural Network Regression)
  • Lasso Regression
  • Decision Tree Regressor

The complete source codes of data pre-processing for the other insignificant variables and other failing models can be found in Nofriani_model_trials_v999991.ipynb.

READING CSV FILES

DATA DENOISING

CALCULATE DAILY OCCUPANCY RATE (tpk_harian)

The original data does not have tpk_online, below is the function to calculate it based on the available rooms and total rooms in the data.

OUTLIER DETECTION AND REMOVAL ON tpk_harian

CALCULATE MONTHLY DATA ON tpk_harian

Sementara pake Mean, jika dirasa ada metode agregasi lain yang lebih mewakili, bisa dicoba disini

MERGE DATA FRAME

EXPORT DATA FRAME

EXPLORATORY DATA ANALYSIS (EDA)

This step hopes to see the rough pattern between tpk_bps (Y variable) against every independent variables. Library scatter plot (imported in the beginning of this source code file) is used for this purpose.

PREPARATION FOR VARIABLE CHOICES

The following is to test the correlations between variable Y and each variable X.

CORRELATION BETWEEN X AND Y

COMBINING DATA_X FOR MODEL TRAINING (model.fit)

PREPARE TEST DATA for TPK PREDICTION in January to June 2021

tpk_online TEST

penerbangan TEST

wisatawan TEST

MERGE DATA TEST

PREPARING TRUE Y VALUE (TRUE tpk_bps)

BUILDING MODEL WITH SCIKIT LEARN

The following are the models I tried running on the data:

  • Linear Regression
  • Ridge Regression
  • Random Forest Regressor
  • Support Vector Regression (SVR)
  • K-Nearest Neighbor Regressor
  • MLPRegressor (Neural Network Regression)
  • Lasso Regression
  • Decision Tree Regressor

However, SVR gives the best RMSE on my project, therefore in this file I only include the code for SVR. The codes for other models can be found on file Nofriani_model_trials_v999991.ipynb

SUPPORT VECTOR REGRESSION